Multihead Attention
π§ 1. What Is Scaled Dot-Product Attention?β
At its core, attention is about:
"How much should I focus on each word in the input?"
The scaled dot-product attention formula is:
Meaning of terms:β
K
= Key: What each word offersV
= Value: What each word "contains"- = dimension of the key vector
softmax(...)
turns similarity scores into probabilities
The scaling by ensures that the dot products donβt get too large, which would make softmax overly peaky (one word gets nearly all attention).
π 2. What Is Multi-Head Attention?β
Instead of running just one attention mechanism, we run several in parallel.
Each βheadβ gets its own Q, K, V matrices (via learned linear layers), allowing it to:
- Learn different relationships (e.g., syntax, grammar, position)
- Attend to different tokens in the same input
Example:β
Suppose your input embeddings are 4-dimensional, and you use 2 heads:
- Each head gets 2-dimensional Q, K, V vectors (split)
- Attention is computed separately in each head
- Outputs are concatenated and passed through a final linear layer
π§ This gives the model more flexibility and a richer representation.
ποΈ 3. Whatβs a Transformer Encoder Layer?β
Each encoder layer in a Transformer consists of:
- Multi-head self-attention
- Input goes through multiple attention heads
- The output is context-aware embeddings
- Add & Norm
- Add the original input (residual connection)
- Normalize the result (layer normalization)
- Feed-forward Network (FFN)
- A linear β activation β linear block applied to each position
- Adds extra non-linearity
- Add & Norm again
- Another residual + normalization
You can stack multiple encoder layers to allow deeper understanding of patterns.
π§© 4. What About Transformers for Translation?β
In translation, we use:
- Encoder: Processes the source sentence (e.g., French)
- Decoder: Generates the target sentence (e.g., English)
Decoder Specialties:β
- Uses masked self-attention to prevent cheating (no peeking at future words)
- Uses cross-attention: queries come from the decoder, but keys and values come from the encoder output
π§ 5. How Do You Build This in PyTorch?β
Multihead Attentionβ
Removed "python CopyEdit" as it's not part of the code
attention = nn.MultiheadAttention(embed_dim=4, num_heads=2, batch_first=False) output, weights = attention(Q, K, V)
Removed "python CopyEdit"
encoder_layer = nn.TransformerEncoderLayer(d_model=4, nhead=2) transformer = nn.TransformerEncoder(encoder_layer, num_layers=2)
x = torch.rand(seq_len, batch_size, embed_dim) # Assuming these vars are defined elsewhere out = transformer(x)
Each layer will apply:
- Multi-head attention
- Add & Norm
- Feedforward
- Add & Norm again
π― 6. Recapβ
Concept | Key Idea |
---|---|
Attention | Focuses on important words via dot-product |
Scaled Dot Product | Softmax on similarity scores, scaled by βdβ |
Multi-Head Attention | Parallel attention heads learn different patterns |
Encoder | Stacks attention + FFN with normalization |
Decoder | Adds masking and cross-attention |
Transformers | Stack layers β deeper representations |